The "candy power ranking" dataset (provided by FiveThirtyEight, distributed under the CC0:Public Domain license, found on github) contains 85 candy brands. Included are the overall attributes (10 attributes, e.g., contains chocolate, caramel, or nougat), price as a percentage, and picking rate (as a percentage) against the compared candy.
The participants' task was to choose their preferred candy from a pairwise comparison. While the authors briefly explained the method of participant selection, no further details about the data collection are known. For the calculations, it is assumed that the design was balanced and that each candy was shown an equal number of times and paired against each other candy an equal number of times.
The objective of this analysis is twofold. The first goal is to identify the most important characteristics for the preferred sweets. The second is to formulate the findings found into a recommondation for action.
Description of the variables:
(taken from the official github README.md)
| Header | Description |
|---|---|
| chocolate | Does it contain chocolate? |
| fruity | Is it fruit flavored? |
| caramel | Is there caramel in the candy? |
| peanutalmondy | Does it contain peanuts, peanut butter or almonds? |
| nougat | Does it contain nougat? |
| crispedricewafer | Does it contain crisped rice, wafers, or a cookie component? |
| hard | Is it a hard candy? |
| bar | Is it a candy bar? |
| pluribus | Is it one of many candies in a bag or box? |
| sugarpercent | The percentile of sugar it falls under within the data set. |
| pricepercent | The unit price percentile compared to the rest of the set. |
| winpercent | The overall win percentage according to 269,000 matchups. |
# importing needed packages
# importing packages needed system-wise
from pathlib import Path
import datetime as date
# mathematics & data manipulation tools
import pandas as pd
import numpy as np
# import plotting tools
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly.figure_factory as ff
# import scipy
import scipy.stats as stats
# ml models
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import BayesianRidge
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.linear_model import ElasticNet
from sklearn.linear_model import OrthogonalMatchingPursuit
from sklearn.linear_model import SGDRegressor
from sklearn.kernel_ridge import KernelRidge
from sklearn.svm import SVR
from sklearn.metrics import r2_score
from sklearn.model_selection import train_test_split
# import helper functions
import functions.functions as hp
# create path representation & grab current date
cwd = Path().cwd().parent
today = date.datetime.today().strftime("%Y-%m-%d")
# output path and date
print("working directory (please check if its correct for your working directory!!):")
print(cwd)
print()
print("date:", today)
working directory (please check if its correct for your working directory!!): /home/daniel/git/cookie_driver_analysis date: 2020-12-30
# import dataset
df = pd.read_csv([p for p in (cwd/"data").glob("*.csv")][0], sep=",")
# show if the data was imported correctly
df.head()
| competitorname | chocolate | fruity | caramel | peanutyalmondy | nougat | crispedricewafer | hard | bar | pluribus | sugarpercent | pricepercent | winpercent | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100 Grand | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0.732 | 0.860 | 66.971725 |
| 1 | 3 Musketeers | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0.604 | 0.511 | 67.602936 |
| 2 | One dime | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.011 | 0.116 | 32.261086 |
| 3 | One quarter | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.011 | 0.511 | 46.116505 |
| 4 | Air Heads | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.906 | 0.511 | 52.341465 |
# check for data integrity
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 85 entries, 0 to 84 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 competitorname 85 non-null object 1 chocolate 85 non-null int64 2 fruity 85 non-null int64 3 caramel 85 non-null int64 4 peanutyalmondy 85 non-null int64 5 nougat 85 non-null int64 6 crispedricewafer 85 non-null int64 7 hard 85 non-null int64 8 bar 85 non-null int64 9 pluribus 85 non-null int64 10 sugarpercent 85 non-null float64 11 pricepercent 85 non-null float64 12 winpercent 85 non-null float64 dtypes: float64(3), int64(9), object(1) memory usage: 8.8+ KB
# check if each sweet is only once in the dataset
df["competitorname"].duplicated().value_counts() # --> all 85 brands are non duplicates
False 85 Name: competitorname, dtype: int64
Dataset was imported and is valid. There are 58 brands and no missing values in the features. All features are numerics, except the competitorname (which should be a string or an object in this case). As we have already seen, winpercent is formatted as a natural percentage. To work on the same feature basis in the train and test datasplits we should divide winpercent also by 100 to match the other percentage columns.
# safecopy original values and divide it by 100
df["winpercent_original"] = df["winpercent"].copy()
df["winpercent"] = df["winpercent"]/100
Before starting to work on the model, a train/test split is created.
# create train_test_split for final evaluation (fixed random_state for reproducability)
X_train, X_test, Y_train, Y_test = train_test_split(df.iloc[:,1:-2], df.iloc[:,-2], test_size=.25, random_state=1)
Let's start by getting a feel for the data itself. In particular the distribution.
# plot the distribution
hp.plot_distribution(X_train, Y_train, "Distribution for each variable")
At first glance, winpercent looks closle to normal distributed. However, because of the binning (for the plot) we should double check it.
# further analysis of the dependend variable (winpercent)
print("Overview")
display(Y_train.describe())
Overview
count 63.000000 mean 0.499796 std 0.145527 min 0.234178 25% 0.391633 50% 0.471732 75% 0.593827 max 0.841803 Name: winpercent, dtype: float64
# closer look at winpercent
ff.create_distplot([Y_train],
["winpercent"],
bin_size=.1).update_layout(title_text="winpercent in detail",
height=600,
showlegend=False).show()
For many statistical tests is a normal distributed dependend variable a must.
Visually, the winpercent tends to be a little bit skewed.
Nevertheless, better safe than sorry, let's check it with statistics.
# check normal distribution
hp.check_normal_distribution(Y_train)
shapiro-wilk: ShapiroResult(statistic=0.9759279489517212, pvalue=0.25275662541389465) K²: NormaltestResult(statistic=2.4887828525438636, pvalue=0.2881161947346958) anderson: test stat: 0.5 Is the test value smaller than the critical value for: sig. lvl 15.0 : True sig. lvl 10.0 : True sig. lvl 5.0 : True sig. lvl 2.5 : True sig. lvl 1.0 : True
Shapiro, K² and Anderson-Darling all do not reject the null hypothesis (p-values are greater than .5 and crit-value is higher than the test value). Therefore, it's safe to assume a normally distributed dependend variable and parametric models are usable for higher test power.
Let's get a feel for the relationship between each of the features with the dependent variable and also between the features themselfe.
# check correlations between features and dependend variable
hp.find_corr(
["chocolate", "fruity", "caramel", "peanutyalmondy", "nougat","crispedricewafer", "hard", "bar", "pluribus"],
["sugarpercent", "pricepercent"],
X_train,
Y_train
)
| chocolate | fruity | caramel | peanutyalmondy | nougat | crispedricewafer | hard | bar | pluribus | sugarpercent | pricepercent | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| correlation | 0.65 | -0.4 | 0.04 | 0.46 | 0.24 | 0.33 | -0.32 | 0.44 | -0.28 | 0.23 | 0.35 |
| p-value | 0.00 | 0.0 | 0.75 | 0.00 | 0.06 | 0.01 | 0.01 | 0.00 | 0.03 | 0.06 | 0.01 |
With the exception of caramel, all features appear to be related to winpercent with chocolate appearing to be by far the best predictor at the moment.
Before proceeding, we should check whether the features are correlated with themselves. This information is important for further interpretation of the regression results and evaluation of feature importance.
To check for multicorrelation, we can use the VIF score:
# calculate possible multicollinearity
VIF = hp.multicolinearity(X_train)
VIF
features are: ['chocolate', 'fruity', 'caramel', 'peanutyalmondy', 'nougat', 'crispedricewafer', 'hard', 'bar', 'pluribus', 'sugarpercent', 'pricepercent']
| VIF coef | |
|---|---|
| bar | 3.988282 |
| chocolate | 2.764410 |
| fruity | 2.706221 |
| nougat | 2.099627 |
| crispedricewafer | 1.835499 |
| pluribus | 1.690359 |
| pricepercent | 1.654920 |
| caramel | 1.485838 |
| peanutyalmondy | 1.445967 |
| hard | 1.280195 |
| sugarpercent | 1.225047 |
If the characteristics are not related, the VIF coefficient should be 1.
In our case its between 1 and 5 (with the highest values for bar, chocolate, fruity & nougat) and therefore they are moderately related.
Nevertheless, we should take this fact into account when interpreting the meanings of the characteristics.
Another option is to use different methods to assess importance, such as Shapley values (aka. dominance analysis) or relative weights (for an great overview please refere to Nathans, Oswald & Nimon, 2012.)
But in order: Datacleaning and finding the optimal model
The winpercent feature has already been transformed to be in the same range as the other two features in percentage. All columns have non-missings (as indicated by the DataFrame.info). The scales are binary and continuous and already exist as numerical values.
No feature conversion (like one-hot-encoding) is therefore not required.
Feature scaling is most likely not required due to two facts:
Yet, just to be on the safe side, we should review the general descriptive statistics:
# description of the percentage features
X_train[["sugarpercent", "pricepercent"]].describe()
| sugarpercent | pricepercent | |
|---|---|---|
| count | 63.000000 | 63.000000 |
| mean | 0.504968 | 0.471460 |
| std | 0.276090 | 0.271723 |
| min | 0.011000 | 0.011000 |
| 25% | 0.313000 | 0.279000 |
| 50% | 0.465000 | 0.465000 |
| 75% | 0.732000 | 0.651000 |
| max | 0.988000 | 0.976000 |
Visually, there is not much difference between the variables. Let's double check this with a statistical test:
# levene test for check for std. equallity
stats.levene(X_train["sugarpercent"], X_train["pricepercent"])
LeveneResult(statistic=0.07720479900660633, pvalue=0.7815844866542734)
# t-test for mean differences:
stats.ttest_rel(X_train["sugarpercent"], X_train["pricepercent"])
Ttest_relResult(statistic=0.7899866018791739, pvalue=0.4325467109407668)
X_train.shape
(63, 11)
Statistically there is also no big difference, thus leaving the features in the original range.
Normally a outlier analysis would follow. There are multiple reasons to skip this process:
The idea is to use cross validation to account for the smaller sample size.
This method is combined with an grid search over different learners to find the best base learner for the formulation of the model.
It's probably also better to use more simpler learners (fewer parameters to tune) to account for possible overfitting during training steps.
# planned model calculations
models = {
"LinearRegression":{"class":LinearRegression(), "parameters":{"fit_intercept":[True,False]}},
"Ridge":{"class":Ridge(), "parameters":{"alpha":np.linspace(1,7,4),
"solver":["svd", "cholesky", "lsqr", "sag"]}},
"BayesianRidge":{"class":BayesianRidge(), "parameters":{
"alpha_1":np.linspace(1e-6, 3, 5),
"alpha_2":np.linspace(1e-6, 3, 5),
"lambda_1":np.linspace(1e-6, 3, 5),
"lambda_2":np.linspace(1e-6, 3, 5)
}},
"Lasso":{"class":Lasso(), "parameters":{"alpha":np.linspace(1,10,10)}},
"OrthogonalMatchingPursiut":{"class":OrthogonalMatchingPursuit(), "parameters":{
"n_nonzero_coefs":np.linspace(1,10,10).astype(int),
"fit_intercept":[True, False]
}},
"KernelRidge":{"class":KernelRidge(), "parameters":{
"alpha":np.linspace(1,10,10),
"kernel":["linear", "polynomial", "rbf", "sigmoid"]
}}
}
# run learning phase
results= hp.grid_search(X_train, Y_train, models)
cleanScore = hp.make_clean(results)[
["model","params", "n_total",
"mean_test_score", "mean_train_score", "delta",
"adjusted_mean_test_score",
"adjusted_mean_train_score"]
].sort_values("mean_test_score", ascending=False)
# reset index
cleanScore.reset_index(inplace=True, drop=True)
Fitting 3 folds for each of 2 candidates, totalling 6 fits Fitting 3 folds for each of 16 candidates, totalling 48 fits Fitting 3 folds for each of 625 candidates, totalling 1875 fits Fitting 3 folds for each of 10 candidates, totalling 30 fits Fitting 3 folds for each of 20 candidates, totalling 60 fits Fitting 3 folds for each of 40 candidates, totalling 120 fits
# top models
cleanScore.head()
| model | params | n_total | mean_test_score | mean_train_score | delta | adjusted_mean_test_score | adjusted_mean_train_score | |
|---|---|---|---|---|---|---|---|---|
| 0 | KernelRidge | {'alpha': 1.0, 'kernel': 'polynomial'} | 63 | 0.385501 | 0.631234 | 0.245732 | 0.252962 | 0.551696 |
| 1 | OrthogonalMatchingPursiut | {'fit_intercept': True, 'n_nonzero_coefs': 1} | 63 | 0.378243 | 0.420692 | 0.042449 | 0.244139 | 0.295744 |
| 2 | KernelRidge | {'alpha': 2.0, 'kernel': 'polynomial'} | 63 | 0.377099 | 0.564059 | 0.186960 | 0.242747 | 0.470033 |
| 3 | OrthogonalMatchingPursiut | {'fit_intercept': True, 'n_nonzero_coefs': 4} | 63 | 0.373003 | 0.572401 | 0.199397 | 0.237769 | 0.480173 |
| 4 | OrthogonalMatchingPursiut | {'fit_intercept': True, 'n_nonzero_coefs': 5} | 63 | 0.369872 | 0.602622 | 0.232749 | 0.233963 | 0.516912 |
We can see already three things:
So let's do a re-ranking by sorting on the "delta" and throwing out models with test score below .2
# show learners with at least .2 test score. sorted by lowest delta
cleanScore[cleanScore["mean_test_score"] > .2].sort_values("delta", ascending=True).head()
| model | params | n_total | mean_test_score | mean_train_score | delta | adjusted_mean_test_score | adjusted_mean_train_score | |
|---|---|---|---|---|---|---|---|---|
| 1 | OrthogonalMatchingPursiut | {'fit_intercept': True, 'n_nonzero_coefs': 1} | 63 | 0.378243 | 0.420692 | 0.042449 | 0.244139 | 0.295744 |
| 352 | KernelRidge | {'alpha': 1.0, 'kernel': 'sigmoid'} | 63 | 0.232959 | 0.324273 | 0.091314 | 0.067519 | 0.178528 |
| 438 | KernelRidge | {'alpha': 7.0, 'kernel': 'polynomial'} | 63 | 0.200502 | 0.322654 | 0.122152 | 0.028061 | 0.176560 |
| 202 | BayesianRidge | {'alpha_1': 1e-06, 'alpha_2': 0.75000075, 'lam... | 63 | 0.264574 | 0.389878 | 0.125304 | 0.105953 | 0.258283 |
| 303 | KernelRidge | {'alpha': 6.0, 'kernel': 'polynomial'} | 63 | 0.245232 | 0.371114 | 0.125881 | 0.082439 | 0.235471 |
Interesting:
The second place (OrthogonalMatchingPursiut) shows up as first place this time. Let's plot it:
# show for each learner the best settings
scoreSorted = hp.sort_search_results(cleanScore)
hp.score_plot(scoreSorted[0], "mean_test_score", "delta", "params", "Delta sorted graphic (mean_test_score)")
hp.score_plot(scoreSorted[1], "mean_test_score", "delta", "params", "Test score sorted graphic (mean_test_score)")
Indeed, OrthogonalMatchinPursiut looks like the "best" learner compared to the other six.
Although the model does not deliver dream scores, its by far the best among the available ones.
But why don't we just take the model with the best training values (as the graph below shows for the linear regression models)?
hp.score_plot(scoreSorted[2], "mean_train_score", "delta", "params", "Train score sorted graphic (mean_train_score)")
The problem with such an approach is that we would choose a model that only pretends to be good. In fact, the linear regression scores only at about 0.15 for unseen data (while the training score imposes a model fit of round about 0.7).
This implies that a linear regression model is unlikely to provide any explanatory power.
The best strategy in this case is to rely on the model with the highest test score (among the available) and try to estimate the feature importance on it.
Let's proceed with the OrthogonalMatchingPursiut algorithmn and check the final score on the hold-out data:
# check the final score on the validity set
# get selected model models
bestModel = cleanScore.loc[1, ["model", "params", "mean_test_score"]]
# check the choosen
print(bestModel)
# extract params
bestParams = bestModel["params"]
model OrthogonalMatchingPursiut
params {'fit_intercept': True, 'n_nonzero_coefs': 1}
mean_test_score 0.378243
Name: 1, dtype: object
# print all param keys
print([key for key in bestParams])
# calculate model on complete train data
model = OrthogonalMatchingPursuit(
fit_intercept = bestParams.get("fit_intercept"),
n_nonzero_coefs=bestParams.get("n_nonzero_coefs")
)
model.fit(X_train, Y_train)
# calculate r2_score
y = model.predict(X_test)
y_train = model.predict(X_train)
score = r2_score(Y_test, y)
train_score = r2_score(Y_train, y_train)
print()
print("validity set score (true model score):", round(score,2))
print("train set score (for comparison):", round(train_score,2))
print("cross validity score:", bestModel["mean_test_score"].round(2))
['fit_intercept', 'n_nonzero_coefs'] validity set score (true model score): 0.37 train set score (for comparison): 0.42 cross validity score: 0.38
# fit model on complete dataset
model.fit(df[X_train.columns], df["winpercent"])
modelScoreComparison = model.score(df[X_train.columns], df["winpercent"])
print("Train score after using complete dataset: ", round(modelScoreComparison,2))
Train score after using complete dataset: 0.41
So, in the end, the cross validity score did not decrease compared to the hold-out data score (.38 vs. .37). Although the value is still not great, the low deviation between both scores indicate a relativ robust model.
Let's take a look at the regression weights to get a first impression of the importance of each variable:
# weights
weights = pd.DataFrame(model.coef_, index = X_train.columns, columns = ["weights"]).T
weights["intercept"] = model.intercept_
display(weights.T)
| weights | |
|---|---|
| chocolate | 0.187793 |
| fruity | 0.000000 |
| caramel | 0.000000 |
| peanutyalmondy | 0.000000 |
| nougat | 0.000000 |
| crispedricewafer | 0.000000 |
| hard | 0.000000 |
| bar | 0.000000 |
| pluribus | 0.000000 |
| sugarpercent | 0.000000 |
| pricepercent | 0.000000 |
| intercept | 0.421423 |
As we can see, the most important aspect (by far) when choosing a candy is chocolate. Nevertheless, let's try to get a feature ranking.
The idea is to calculate the shapley values as a proxy for the importance. To do this, a dominance analysis is performed.
Dominance analysis is quite computationally expensive. This method calculates the degree of explanation for each possible feature combination and extracts the unique importance for each feature. Therefore, we need to run:
calculations (while p euqals the number of features). In our case it is:
$$2^{11}-1 = 2047 $$The curse of the relative small data set is at the same time a blessing, as it allows us to perform all calculations quite quickly.
# check the model again
model
OrthogonalMatchingPursuit(n_nonzero_coefs=1)
# call the function
dominanceScore = hp.dominance(df[X_train.columns], df["winpercent"], model)
calculating R2 for each model
all R2s calculated. Building final dominance score.
All calculations done
# convert results into nicely formatted table
dominanceScore = pd.DataFrame(dominanceScore, index=["dominance score"])
# check numerical correctness (sum of all predictor scores has to be complete model score)
print("Sum of dominance scores: ", dominanceScore.sum(axis=1).round(5).values[0])
print("Model R2: ", round(modelScoreComparison,5))
print("Sum of dominance scores is model r2:",
dominanceScore.sum(axis=1).round(5).values[0] == round(modelScoreComparison,5))
Sum of dominance scores: 0.40515 Model R2: 0.40515 Sum of dominance scores is model r2: True
scores = (dominanceScore.T/dominanceScore.sum(axis=1)*100).round(2).sort_values("dominance score", ascending=False)
# kill negative values (can be due to model mismatch by sololy using this one predictor)
scores[scores["dominance score"] < 0] = 0
scores
| dominance score | |
|---|---|
| chocolate | 63.85 |
| bar | 9.47 |
| peanutyalmondy | 7.02 |
| fruity | 5.39 |
| pricepercent | 3.79 |
| crispedricewafer | 3.11 |
| hard | 2.73 |
| pluribus | 1.50 |
| sugarpercent | 1.23 |
| caramel | 1.03 |
| nougat | 0.89 |
As has been suggested many times (e.g., by correlations, etc.), chocolate is by far the most important predictor for candy success.
But we are not done yet. Let's check some more things.
One of the side questions raised is what type of candy to focus on: cookies or gummies.
Also, we should take into considertation the feature which drives the price the most.
We will first analyze the prefered candy type.
Unfortunately, there is no clear indicator for goup/type membership in the dataset.
But perhaps we can gain this information by splitting the dataset by the characteristics chocolate and fruity:
# make safe copy of the original dataset
tData = df.copy()
# add a group membership variable
tData["group"] = 0
for i, row in tData.iterrows():
if row["chocolate"] == 1:
if row["fruity"] == 1:
tData.loc[i, "group"] = 3
else:
tData.loc[i, "group"] = 1
else:
if row["fruity"] == 1:
tData.loc[i, "group"] = 2
else:
tData.loc[i, "group"] = 4
# lets get an overview of the groups:
tData["group"].value_counts().sort_index().rename({1:"chocolate based", 2:"fruity based", 3:"contains both", 4:"neither"})
chocolate based 36 fruity based 37 contains both 1 neither 11 Name: group, dtype: int64
If I had to guess, the chocolate based candys are cookies, fruit based are gummies and the rest is something else. Let's search the net quickly and see if we are correct
# display the first 10 elements of each group
candyGroup = tData.groupby("group")
for key in candyGroup.groups.keys():
print("candygroup: ", key)
print("--------------------")
# get group
temp = candyGroup.get_group(key)
# check if we have more than 10 datapoints:
if temp.shape[0] > 10:
for _, row in temp.iloc[:5, :].iterrows():
hp.scrapping(row["competitorname"])
else:
hp.scrapping(temp.iloc[0,0])
candygroup: 1 -------------------- candy: 100 Grand
candy: 3 Musketeers
candy: Almond Joy
candy: Baby Ruth
candy: Charleston Chew
candygroup: 2 -------------------- candy: Air Heads
candy: Caramel Apple Pops
candy: Chewey Lemonhead Fruit Mix
candy: Chiclets
candy: Dots
candygroup: 3 -------------------- candy: Tootsie Pop
candygroup: 4 -------------------- candy: One dime
candy: One quarter
candy: Boston Baked Beans
candy: Candy Corn
candy: Haribo Happy Cola
Well, acutaly, these candies are more like: chocolet bar vs. gummies vs. blends vs. others (like gummies that are not fruity).
(Also, not every picture suits the competitor name perfectly. E.g. "one dime" and "Daim")
Unfortunately, only two of these four types have a reasonable sample size (which is also comparatively equal). So let's check which candy type - chocolet bar vs. gummies is in generall more prefered by the consumers (If I had to guess again, I bet it is the chocolet bar).
# doing a t-test for chocolet bar vs. gummies
tTest = stats.ttest_ind(tData.loc[tData["group"] == 1, "winpercent"],
tData.loc[tData["group"] == 2, "winpercent"], equal_var=False)
tTest
Ttest_indResult(statistic=6.312215507955366, pvalue=2.493391944596787e-08)
# Graphical representation
go.Figure([
go.Bar(x=["chocolate bar", "gummies"],
y=[tData.loc[tData["group"] == 1, "winpercent"].mean(),
tData.loc[tData["group"] == 2, "winpercent"].mean()])
]).update_layout(title_text="Chocolate bar vs. gummies").show()
The chocolate bar candy typ has a higher win rate by about 20pp. on average.
This result is in consistent with the importance analysis: Chocolate based products are more preferede by consumers, therefore the chocolate bar should also be prefered candy.
Last but not least, we should check which product characteristic drives the price of the products the most:
# split the "pricepercent" from the rest of the train dataset
# --> -1 is the pricepercent variable
X_train_price=X_train.iloc[:,:-1]
Y_train_price=X_train.iloc[:,-1]
# do the same for the test set
X_test_price=X_test.iloc[:,:-1]
Y_test_price=X_test.iloc[:,-1]
# are the characteristics correlated to the new regressor?
# check correlations between features and dependend variable
hp.find_corr(
["chocolate", "fruity", "caramel", "peanutyalmondy", "nougat","crispedricewafer", "hard", "bar", "pluribus"],
["sugarpercent"],
X_train_price,
Y_train_price
)
| chocolate | fruity | caramel | peanutyalmondy | nougat | crispedricewafer | hard | bar | pluribus | sugarpercent | |
|---|---|---|---|---|---|---|---|---|---|---|
| correlation | 0.41 | -0.4 | 0.14 | 0.32 | 0.17 | 0.29 | -0.20 | 0.52 | -0.21 | 0.24 |
| p-value | 0.00 | 0.0 | 0.29 | 0.01 | 0.18 | 0.02 | 0.12 | 0.00 | 0.10 | 0.05 |
# due to the same problem as in the first analysis, we are going to use the same model (options):
results_price = hp.grid_search(X_train_price, Y_train_price, models)
cleanScore_price = hp.make_clean(results_price)[
["model","params", "n_total",
"mean_test_score", "mean_train_score", "delta",
"adjusted_mean_test_score",
"adjusted_mean_train_score"]
].sort_values("mean_test_score", ascending=False)
# reset index
cleanScore_price.reset_index(inplace=True, drop=True)
Fitting 3 folds for each of 2 candidates, totalling 6 fits Fitting 3 folds for each of 16 candidates, totalling 48 fits Fitting 3 folds for each of 625 candidates, totalling 1875 fits Fitting 3 folds for each of 10 candidates, totalling 30 fits Fitting 3 folds for each of 20 candidates, totalling 60 fits Fitting 3 folds for each of 40 candidates, totalling 120 fits
# show results
cleanScore_price.head()
| model | params | n_total | mean_test_score | mean_train_score | delta | adjusted_mean_test_score | adjusted_mean_train_score | |
|---|---|---|---|---|---|---|---|---|
| 0 | KernelRidge | {'alpha': 4.0, 'kernel': 'polynomial'} | 63 | 0.153732 | 0.336783 | 0.183052 | -0.009012 | 0.209242 |
| 1 | KernelRidge | {'alpha': 3.0, 'kernel': 'polynomial'} | 63 | 0.153299 | 0.361969 | 0.208670 | -0.009528 | 0.239271 |
| 2 | KernelRidge | {'alpha': 5.0, 'kernel': 'polynomial'} | 63 | 0.148371 | 0.313863 | 0.165491 | -0.015403 | 0.181913 |
| 3 | KernelRidge | {'alpha': 2.0, 'kernel': 'polynomial'} | 63 | 0.142566 | 0.391967 | 0.249401 | -0.022325 | 0.275037 |
| 4 | KernelRidge | {'alpha': 6.0, 'kernel': 'polynomial'} | 63 | 0.139310 | 0.292123 | 0.152813 | -0.026208 | 0.155992 |
scoreSorted_price = hp.sort_search_results(cleanScore_price, .1)
hp.score_plot(scoreSorted_price[0], "mean_test_score", "delta", "params", "Delta sorted graphic (mean_test_score)")
# calculate model on complete train data
model_price = KernelRidge(
alpha = 8,
kernel="polynomial",
)
model_price.fit(X_train_price, Y_train_price)
# calculate r2_score
y_price = model_price.predict(X_test_price)
y_train_price = model_price.predict(X_train_price)
score_price = r2_score(Y_test_price, y_price)
train_score_price = r2_score(Y_train_price, y_train_price)
print()
print("validity set score (true model score):", round(score_price,2))
validity set score (true model score): 0.38
# looks better than first thought --> calculate model on complete dataset
# fit model on complete dataset
model_price.fit(df[X_train_price.columns], df["pricepercent"])
modelScoreComparison_price = model_price.score(df[X_train_price.columns], df["pricepercent"])
print("Train score after using complete dataset: ", round(modelScoreComparison_price,2))
Train score after using complete dataset: 0.36
# dominance score
dominanceScore_price = hp.dominance(df[X_train_price.columns], df["pricepercent"], model_price)
calculating R2 for each model
all R2s calculated. Building final dominance score.
All calculations done
# convert results into nicely formatted table
dominanceScore_price = pd.DataFrame(dominanceScore_price, index=["dominance score"])
scores_price = (dominanceScore_price.T/dominanceScore_price.sum(axis=1)*100).round(2).sort_values("dominance score", ascending=False)
# kill negative values (can be due to model mismatch by sololy using this one predictor)
scores_price[scores_price["dominance score"] < 0] = 0
scores_price
| dominance score | |
|---|---|
| bar | 29.93 |
| chocolate | 27.35 |
| sugarpercent | 18.18 |
| fruity | 7.97 |
| peanutyalmondy | 7.49 |
| crispedricewafer | 7.28 |
| caramel | 2.54 |
| hard | 1.02 |
| pluribus | 0.00 |
| nougat | 0.00 |
Accordin to the analysis, the biggest drivers for the price are bar, chocolate & sugarpercent. In one word, it looks like the chocolate bar candies are more expensive than the other candies. Yet, they are prefered more often.
Summing up all infos:
However, there are still unanswered questions that are important for a recommondation:
The production cost for the candy is not known. It is possible that chocolate-based products are not as profitable due to increased costs in manufacturing.
In addition, other situation-related questions are also unresolved. For example consumption occasions. If gummy bears are consumed at way more occasions - even though they are not the preferred product - the total sales volumen would still be higher for the bears.
Also, the data is still very limited. It is likely that a better model can be found base on more data alone.
The following steps are therefore recommended: